It’s important to think about what you expect data to look like before you start collecting it (or analysing it, if you’re using a secondary dataset). Let’s start with sketching.
Think about the grid task: there is a 4x4 grid, on each trial 4 cells light up. The particpant gets rewarded for selecting any four cells that aren’t the ones that lit up. Here’s what we’re measuring, in vague terms:
Using the Miro board at your table, sketch with your group what you think the results will look like under - conditions where this innovation task does not lead to CCE, versus - conditions where the innovation task does lead to CCE
In either case, you might have ideas about how different learner types perform differently; discuss this and try to integrate it into your sketches. Once you have some final sketches, take some screenshots of these. You’ll use these to compare with some real data as prt of the next exercise
That first exercise was designed to get you thinking about what you expect data to look like. Sketching is useful at this stage because you might not have access to the data yet, but it also sets up your hypotheses in a concrete visual way. This starts to form part of a pipeline for you work: you already know how you’ll need to visualise your data to check your hypotheses before you even have it.
Based on your sketches you may have come up with a variety of ways you’d like to look at the data. In this exercise, we’ll look at actual data from this task from Saldana et al (2019), which tested this grid/innovation task in both children and baboons. While there are many ways you could map your variables visually, we’ll choose one to move foward with. A bit later, we’ll discuss in more detail how to make decisions about dealing visually with variables.
Let’s start by looking at the raw data that was provided on OSF from Saldana et al (2019). I highly encourage you to look at the whole paper in detail eventually (if you haven’t already), but for now, this exercise might be more useful if you deal with the data a bit more naively.
The data come in two separate data files, dataBaboons.csv and dataChildren.csv, which are both in the data folder. As noted in the brief lecture, these are both in long format: this means that every line is a single response The first step is to read the data as variables:
children<-read_csv("data/dataChildren.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## .default = col_character(),
## GridSeen = col_double(),
## GridDone = col_double(),
## Generation = col_double(),
## TrialNb = col_double(),
## ChainNb = col_double(),
## BinTetroDone = col_double(),
## BinTetroSeen = col_double(),
## Score = col_double(),
## SymmetryBin = col_double(),
## TrialName = col_double()
## )
## ℹ Use `spec()` for the full column specifications.
#uncomment below and run the chunk to look at the head of the file; this will display the first 6 rows and give you a sense of what the columns are
head(children)
## # A tibble: 6 x 20
## GridSeen GridDone Generation PartID PrevPartID TrialNb DateTime ChainNb
## <dbl> <dbl> <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1158 1678 1 A_01 A_seed 1 Wed Jul 11 15:… 1
## 2 1274 1624 1 A_01 A_seed 2 Wed Jul 11 15:… 1
## 3 1467 841 1 A_01 A_seed 3 Wed Jul 11 15:… 1
## 4 207 479 1 A_01 A_seed 4 Wed Jul 11 15:… 1
## 5 1633 9 1 A_01 A_seed 5 Wed Jul 11 15:… 1
## 6 615 1811 1 A_01 A_seed 6 Wed Jul 11 15:… 1
## # … with 12 more variables: TetrominoDone <chr>, BinTetroDone <dbl>,
## # TopBottomDone <chr>, RightLeftDone <chr>, TetrominoSeen <chr>,
## # BinTetroSeen <dbl>, TopBottomSeen <chr>, RightLeftSeen <chr>, Score <dbl>,
## # Symmetry <chr>, SymmetryBin <dbl>, TrialName <dbl>
Note: 1
After looking at the column headers, think about which columns we can dispense with for this particular visualisation. We’ll be making a copy of the dataset with particular columns, so you’re not getting rid of any data. You may not always want to do this, but it can make things a bit easier, and in this case, it’s necessary so that we can put the child and baboon data together to visualise them side by side (we need each of the datasets to have identical column headers in order to do this). For starters, we’ll deal only with generation, score, and proportion of tetrominoes
What columns do we want to keep from the data file?
GenerationScoreBinTetroDoneWe also want predictability and learner type, but those aren’t there yet - how we get these will be come clear later. Before we get a leaner version of the child data, let’s read in the baboon data.
baboons<-read_csv("data/dataBaboons.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## .default = col_double(),
## Name = col_character(),
## Sex = col_character(),
## TestingPhase = col_character(),
## DateTime = col_datetime(format = ""),
## TetrominoDone = col_character(),
## TopBottomDone = col_character(),
## RightLeftDone = col_character(),
## TetrominoSeen = col_character(),
## TopBottomSeen = col_character(),
## RightLeftSeen = col_character(),
## Symmetry = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
#uncomment below and run the chunk to look at the head of the file; this will display the first 6 rows and give you a sense of what the columns are
head(baboons)
## # A tibble: 6 x 22
## GridSeen GridDone Name Sex Age TrialName Score TestingPhase TrialNumber
## <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 883 26 EWINE fema… 91 1 1 Test 1
## 2 1301 1230 EWINE fema… 91 1 0 Test 1
## 3 1043 602 VIOLET… fema… 146 1 1 Test 1
## 4 1280 1819 DAN male 107 1 0 Test 1
## 5 711 211 HARLEM male 55 1 0 Test 1
## 6 1622 1678 FANA fema… 84 1 0 Test 1
## # … with 13 more variables: Generation <dbl>, DateTime <dttm>, ChainNb <dbl>,
## # TetrominoDone <chr>, BinTetroDone <dbl>, TopBottomDone <chr>,
## # RightLeftDone <chr>, TetrominoSeen <chr>, BinTetroSeen <dbl>,
## # TopBottomSeen <chr>, RightLeftSeen <chr>, Symmetry <chr>, SymmetryBin <dbl>
Here, there’s an additional column we need to pay attention to: TestingPhase. For the babboons, they had a control condition where there was no transmission: each babboon was given an entirely new random set of cells (rather than getting the one produced by the previous baboon), but with the same remit to innovate around the prompt rather than reproduce it. This is meant to test whether the transmission plays an important role in a non-copying task. For now, we only care about the transmission condition - we’ll need to use the values in this column to eliminate the non-transmission condition.
First, we’re going to deal with performance and tetromino proportions; these are relatively simple for us to summarize because we can glom all trials from a given generation and learner type together. Calculating entropy requires a set of responses, so we have to first calculate a value for each paticipant, and then summarize across chains in a given generation (if you’re not familiar with entropy, read a bit more about it here - it’s a very useful descriptive measure which can be applied to many different kinds fo data). Let’s deal with the simple case first.
Let’s start with the filter() function to deal with the Testing Phase issue in the babboon dataset. We are interested in the transmission condition; to isolate this, we need to understand exactly how it’s coded in the data. Run the chunk below to find out.
unique(baboons$TestingPhase)
## [1] "Test" "Random"
This shows us that the TestingPhase column contains the values “Test” and “Random”/ Notice how we can use $ to select a particular column from a dataframe. Based on the values coded in the data, replace VALUE HERE with the correct value below to filter out the trials we aren’t interested in.
baboonsTest<-baboons %>%
filter(TestingPhase=="Test")
#replace 'VALUE HERE' with the value you want tokeep
#If you don't replace this, what do you think will happen?
#also note that you can use != with the value you don't want, which is standard notation for "does not equal"
There are a couple things to note about this code chunk. First is that we’re making a new dataframe, baboons_test, which starts with a copy of baboons that we’re modifying in a particular way. We’ll never be using the full baboons dataset in this exercise, but in general (assuming you don’t have storage or memory limitations given a very large dataset), making copies allows you to backtrack and preserve things where necessary.
The next think you’l notice is the %>% known in the tidyverse as a “pipe”. Pipes allow us to chain different commands together to do lots of things at once in a transparent way. Once we’ve gone through our summary step by step, we’ll look at how to do it all in the one go using these kinds of pipes.
Finally, note that while variable names in the tidyverse can generally be written without double quotes, variable values need to be within quotes if they are strings. So while TestingPhase doesn’t require double quotes, the value we’re selecting does need to be in double quotes. Also note that R can only find TestingPhase because it already knows (from the line prior) that we’re dealing with the baboons data frame.
Now we have a version of baboons, baboonsTest which isolates the test trials. Next, let’s isolate the variables we’re interested in - we’ll start by only looking at performance and tetromino proportions over generational “time”. Enter the variable (column) names we want to select as arguments to the select() function below. Note that we need to overwrite baboonsTest with a new version to integrate the select command.
baboonsTest<-baboonsTest %>%
select(Generation,Score,BinTetroDone)
Now, do the same for the children, only we don’t need anything but the select() command because there is only one condition in this data frame. However, we do need to copy the original because we haven’t done that yet.
#copy the children data frame into a new data frame, childVis, and use a pipe to select only the columns we really need
childVis<-children %>%
select(Generation,Score, BinTetroDone)
Now we have each of the datasets separately, but what we really want is to have the child data and baboon data together, so we can compare them visually. For this, we want to use rbind() which stands for “row bind” to basically stick these two data frames together. rbind() requires dataframes with identical column names - uncomment and run the chunk below and note how the rbind call throws an error.
#rbind(children, baboons)
You get the idea. But since we’ve just gone to great lengths to select identical columns from our dataframes, we can just get on with it, right? Well, not quite. Think about why that might not be a good idea for a second before continuing on…
If we throw the frames together now, something will be missing: which rows are children and which rows are baboons? This information wasn’t in the original dataframes, quite resonably, because it wouldn’t have varied between any of the rows in any way. But now that we’re glomming them together, we need to add this for the babboon dataset. Do the same for the child one, and then bind them together into allData
baboonsTest %>%
add_column(LearnerType="Baboon")
## # A tibble: 4,500 x 4
## Generation Score BinTetroDone LearnerType
## <dbl> <dbl> <dbl> <chr>
## 1 1 1 1 Baboon
## 2 1 0 0 Baboon
## 3 1 1 1 Baboon
## 4 1 0 1 Baboon
## 5 1 0 0 Baboon
## 6 1 0 1 Baboon
## 7 1 0 1 Baboon
## 8 1 1 0 Baboon
## 9 1 1 1 Baboon
## 10 1 0 0 Baboon
## # … with 4,490 more rows
#Now uncomment below to do the same for the childVis dataframe
childVis %>%
add_column(LearnerType="Child")
## # A tibble: 1,800 x 4
## Generation Score BinTetroDone LearnerType
## <dbl> <dbl> <dbl> <chr>
## 1 1 1 1 Child
## 2 1 1 1 Child
## 3 1 1 1 Child
## 4 1 1 1 Child
## 5 1 1 0 Child
## 6 1 1 1 Child
## 7 1 0 1 Child
## 8 1 0 1 Child
## 9 1 1 1 Child
## 10 1 1 1 Child
## # … with 1,790 more rows
#Complete the rbind call to add them together
allData<-rbind(baboonsTest,childVis)
Below, check out how all of this this could be done using pipes, which allow us to chain commands together in a much cleaner bit of code.
childVis<-children %>%
select(Generation, Score, BinTetroDone) %>%
add_column(LearnerType="Child")
baboonsTest<-baboons %>%
filter(TestingPhase=="Test") %>%
select(Generation, Score, BinTetroDone) %>%
add_column(LearnerType="Baboon")
alldat<-rbind(baboonsTest,childVis)
Now we have all the data together, we’re ready to summarise and look at our variables. Note that depending on the kind of data you have, it might not make sense to do summary statistics at this point. For example, if you have truly continuous variables, e.g., amount of food eaten (weight in g) and temperature, you might want to start with a scatterplot of the raw data rather than calculating means. However, in this case, we have binary outcome variables (success at innovation or not, produced a tetromino or not). Plotting these without summarizing them would be visually useless. Take a quick look at the plot below (by executing the code chunk) to see why.
pointscore<-ggplot(data=alldat,aes(x=Generation, y=Score))+
geom_point()
pointscore
All of the values are at 0 or 1, and ggplot is simply drawing them on top of each other, so we really can’t see much of anything. Note that this is because when we code categorical data, we’ve done something kind of sneaky and assigned it numbers (0 or 1) - if you looked carefully at the information in the head() call for the original data frame (or the column specifications spit out by read_csv()), you’ll notice that Score is a double. This means that R has assumed this is a float/decimal type number. This is actually deliberate on the part of the authors, because this categorical variable needs to be used to calculate proportions, but it’s caused ggplot to make an incorrect assumption: that values between 0 and 1 are possible. We’re going to calculate proportions that will give us these kinds of values in the end, but remember that even though we might visualise this as a continuous variable, it isn’t one for analysis purposes (i.e., you would need to use logistic rather than linear regression).
Before we make this more useful to look at, let’s use the code for this relatively useless graph to start to understand how ggplot works.
`pointscore<-ggplot(data=alldat,aes(x=Generation, y=Score, colour=LearnerType))+ geom_point()
pointscore `
First, we’re putting our plot into a variable called pointscore. You don’t have to do this - you could just call ggplot(...)+ and it would spit out a plot. You might sometimes prefer this if you’re playing around with looking at some data. But generally, assigning your plot to a variable is cleaner and allows us to play with some aesthetics later without actually changing the original plot, e.g., I can test what happens if I add lines, while still preserving the original graph.
pointscore+geom_line()
pointscore
Adding lines at this point doesn’t make this any less visually meaningless - but let’s push on with understanding some basics of ggplot before we tidy it up.
Inside the initial call to ggplot() we need two things, a data frame that we want ggplot to draw (data = alldat), and the specific aesthetics we want drawn (aes(x=Generation, y=Score)). On top of this, we layer our desired geom(s) - which is essentialy the kind of plot we want: we started wtih points and then added lines. If you don’t add any geoms, ggplot will give you a blank plot - you’ve told it what to plot, but not how to plot it:
ggplot(data=alldat,aes(x=Generation, y=Score))
Layering geoms works how you would expect: the lines will be drawn on top of the points because we called geom_line() after calling geom_point(). You can do this in lots of different ways, including leaving the dataset and aesthetics out of the main ggplot call and instead specifying it inside the geom. The code below is identical to our inital plot:
pointscore<-ggplot()+
geom_point(data=alldat,aes(x=Generation, y=Score))
#pointscore
We’ve gone to the trouble of tidying our data into a single dataframe, so going forward we’ll specify the data in the ggplot call (and this is generally the preferred method). However, I point this out because it’s worth noting that you can visualise multiple datasets on the same plot by using the layering afforded by passing the data/aesthetics to the geoms.
In general, don’t forget your geoms, and make sure the way you’re drawing your data lines up with the kind of data you have: ggplot will not prevent us from plotting a binary outcome variable as a scatterplot using geom_point() (because it interprets it as a double) even though it’s useless. We’ll talk more about how to make good visualisation decisions later on.
However, for now, we need some kind of summary of these binary outcome variables in order to visualise what’s happening. Are scores changing over generations? Are children and baboons performing differently? To see this, we need to calculate the proportion of correctly innovated trials per generation; since the data was coded as 0 for a missed trial and 1 for a successful one (which is a generally useful convention), we can use a simple mean to do this, creating a new variable called meanScore using the tidyr summarise() function.
performance<-alldat %>%
summarise(meanScore=mean(Score))
head(performance)
## # A tibble: 1 x 1
## meanScore
## <dbl>
## 1 0.861
This is great and all - we can see that the overall performance between children and babboons was about 86%, but we’ve lost a bunch of information. We need to add the use of group_by with our indpendent variables.
group_by function belowMeanSE() function (you can add as many new variables to summarise() as you like.Add the relevant bits to the code chunk below, uncomment, and run the code. Look at the head of the new performance data frame and note the difference. (also note that this is overwriting the previous value of the variable performance):
performance<-alldat %>%
group_by(Generation, LearnerType) %>% #add the IVs of interest here
summarise(meanScore=mean(Score),se=MeanSE(Score)) #use MeanSE() to add the standard error
## `summarise()` has grouped output by 'Generation'. You can override using the `.groups` argument.
head(performance)
## # A tibble: 6 x 4
## # Groups: Generation [3]
## Generation LearnerType meanScore se
## <dbl> <chr> <dbl> <dbl>
## 1 1 Baboon 0.711 0.0214
## 2 1 Child 0.861 0.0258
## 3 2 Baboon 0.818 0.0182
## 4 2 Child 0.944 0.0171
## 5 3 Baboon 0.771 0.0198
## 6 3 Child 0.917 0.0207
Note: 2
Now that this has the relevant information we need, we can plot the means over time:
scores<-ggplot(data=performance, aes(x=Generation, y=meanScore))+
geom_point()
scores
Looks much better, and we can see kind of an upward trend (?), but we’re still missing some information - we need to add an aesthetic for the LearnerType; we can’t tell which points are children and which are babboons. Map
LearnerType to the aesthetic colour
#add colour in the list of aesthetics (aes), and map it to LearnerType
scores<-ggplot(data=performance, aes(x=Generation, y=meanScore,colour=LearnerType))+
geom_point()
scores
Now we’re getting somewhere. See if you can add some stuff on your own:
performance summary - the docs will helpful here, but also try the best way to troubleshoot: Google “errorbars in ggplot” and find your own resource#Add a line, errorbars, and a representation of chance performance to the original plot
scores<-scores+
geom_line()+
geom_errorbar(aes(ymin=meanScore-se,ymax=meanScore+se))+
geom_hline(aes(yintercept=0.65))
scores
BinTetroDone variable (summarising into a new data frame, followed by plotting) to graph the proportion of tetrominoes produced over time in each group.tetrominoProp<-alldat %>%
group_by(Generation, LearnerType) %>%
summarise(tetProp=mean(BinTetroDone),se=MeanSE(BinTetroDone))
## `summarise()` has grouped output by 'Generation'. You can override using the `.groups` argument.
tetrominoes<-ggplot(data=tetrominoProp,aes(x=Generation,y=tetProp,colour=LearnerType))+
geom_point()+
geom_line()+
geom_errorbar(aes(ymin=tetProp-se,ymax=tetProp+se))+
geom_hline(aes(yintercept=0.005))
tetrominoes
So far we’ve been dealing with variables that are pretty straightforward: for each trial we had a binary outcome (did they succeed at the trial or not, did they create a tetromino or not) that we used to create a propoortion (by averaging the binary outcomes across both trials and participants for each generation). However, the “predictability” of the cells selected is a more complex variable. For this, we have to calculate the entropy of the set of values of the GridDone variable for each participant.
Do your best to translate the steps below into code - feel free to ask questions if you get stuck!
GridDone variable in R’s table() function, which automatically calculates a frequency table. Use base=exp(1) as an argument to the entropy function to get the exact results from the original paper.childEnt<-children %>%
select(Generation, ChainNb, GridDone) %>%
add_column(LearnerType="Child")
baboonEnt<-baboons %>%
filter(TestingPhase=="Test") %>%
select(Generation, ChainNb,GridDone) %>%
add_column(LearnerType="Baboon")
allEnt<-rbind(childEnt,baboonEnt)
sumEnt<-allEnt %>%
group_by(Generation, ChainNb, LearnerType) %>%
summarise(Entropy=Entropy(table(GridDone),base=exp(1)))
## `summarise()` has grouped output by 'Generation', 'ChainNb'. You can override using the `.groups` argument.
meanEnt<-sumEnt %>%
group_by(Generation, LearnerType) %>%
summarise(meanEntropy=mean(Entropy),se=MeanSE(Entropy))
## `summarise()` has grouped output by 'Generation'. You can override using the `.groups` argument.
predictability<-ggplot(data=meanEnt, aes(x=Generation, y=meanEntropy, colour=LearnerType))+
geom_point()+
geom_line()+
geom_errorbar(aes(ymin=meanEntropy-se,ymax=meanEntropy+se))
predictability
Finally, let’s compare the plots we made to our original expectations, and also to the actual plots from the published paper. - Start by uploading screenshots of your sketches into the project - Use ggsave() to save images of your plots to display next to the other plots (the fact that your final plots are saved to variable names should come in handy). - Plots from the actual publication are already in data/publishedPlots/ within the project - Fill in the correct paths/filenames and execute the code chunks below to compare.
Comparing Scores Visually
Comparing Tetromino Proportions Visually
Comparing Entropy Visually
In the second half of the workshop, we’ll work on tidying these up and making camera-ready visualisations.
In the last session we made decent progress with using R to make basic visualisations and compare those to our sketched expectations. However, we ended on comparison which included some major differences between our graphs and the publication-ready ones: ours generally didn’t look as nice, and the ggplot() defaults didn’t necessarily represent our data in the best way. In this section session, we’ll talk briefly about good practice in data visualisation, before moving onto how to tidy things up to make our plots look better.
Finally, I’ll direct you to some resources on network data visualisation, which is considerably more complex, and time permitting, we’ll start to fiddle with some network data using the igraph and ggnetwork packages.
We ended the last session by thinking briefly about the differences between our graphs and the graphs that were published of the baboon and child cultural transmission data. The major differences are:
First, we’ll deal with the axes, then the colour, and then the general visual feel of the graph, which is also known as the theme.
Labelling your axes is simple in ggplot using xlab() and ylab().
scores<-scores+ylab("Proportion Successful Trials")
scores
Using this as a template, add labels to the axes for the
performance and tetromino graphs.
tetrominoes<-tetrominoes+ylab("Proportion Tetromino Responses")
predictability<-predictability+ylab("Mean Entropy of Response Sets")
We might also want to more accurately label our LearnerType legend. While we generally don’t want spaces in our variable names within dataframes3, we probably do want this for readability on graph labels. We’ll learn how to change this when we deal more generally in scales later on.
First, let’s add a label to the line that denotes random performance (and make it dashed so it’s a bit more obvious it’s not just the bottom of the graph). Look at the documentation for annotate(). Add a label near the random performance line; I’ve already made the line dashed, but note that this has required re-doing the entire ggplot call rather than just adding to the existing plot. Why do you think this is necessary in this case?4
scores<-ggplot(data=performance, aes(x=Generation, y=meanScore,colour=LearnerType))+
geom_hline(yintercept=0.65, linetype="dashed")+
geom_line()+
geom_errorbar(aes(ymin=meanScore-se,ymax=meanScore+se))+
ylab("Proportion Successful Trials")+
annotate("text",x=2,y=0.65,label="Chance\nPerformance")
scores
Finally, we can change the limits of the axes - per the do’s and don’t’s of good datavis, we might want to show the entire range of the variable on the yaxis. We can do this easily with
ylim():
scores+ylim(0,1)
This certainly changes how our results look. The rise in performance now doesn’t look as stark as it did before. However, you’ll notice I used
scores+ylim(0,1) rather than redifining the variable scores as scores<-scores+ylim(0,1). This is because while this is illustrative of the ylim() argument (and there’s a corresponding xlim()), showing the whole range doesn’t make much sense in this case. The relevant benchmark for performance in this task is the random chance line we’ve added, not 0; proportions closer to 0 would indicate a bias for actually copying grid cells despite the constraints of the task disfavouring this. In other words, it’s not especially suprising values down near zero (even below 0.5) aren’t in our data, so not showing them is fine.
The fact that truncatng the y-axis in this case dovetails with making the results look better is a bonus. We’ve even added something the published graph is missing, which is a representation of random performance (the actual “floor” we’d expect in this data"). Remember to be careful not to truncate the axes just because it makes the results look nice - make sure they are also actually nice relative to your expectations.
Once you get around to prettifying the tetromino graph, think about this issue again. The published graph here also has a very truncated y-axis - does it also make sense for that variable? Why or why not?5
So far we’ve made some changes using ylab() and ylim(). These are handy for making quick changes only to these attributes - we might leave it at that if we didn’t have any other issues. However, our x-axis is doing somethign weird whith where it’s deciding to break, at 2.5, 5.0, 7.5, and 10.0, interpreting Generation as a double. However, the value 2.5 isn’t possible for this variable, and where the labels on the axis are is a bit strange.
Enter scale...() layers. We can add these to our plot to specify things like the limits, label, breaks, etc about a scale at the same time. I’ll model this for the y axis of the graph, making some changes we don’t necessarily need, but that might look good:
scores<-scores+
scale_y_continuous(limits=c(0.6,1), breaks=c(0.6,0.8,1.0),name="Proportion Successful Trials")
scores
You can see that I’ve specified the limits and the breaks I want by passing a vector using
c() - note that even if you want e.g., only a single break, you need to pass a vector. Now, apply these concepts to scale_x_continuous(), giving it a range between 0 and 10 with breaks at each generation. Rename it as “Generation (Time)” for funsies:
scores<-scores+
scale_x_continuous(limits=c(1,10), breaks=c(1,2,3,4,5,6,7,8,9,10), name="Generation (Time)")
scores
You’ll notice this has done something a little weird with the errorbars - they are cut off for the first and last generation. This is because ggplot is trying to use coordinates <1 and >10 to draw these lines out horizontally. We’ll deal with this later by changing
geom_errorbar() to geom_pointrange(), which removes the horizontal bars.
Now we have one more scale to deal with - we want a space in the LearnerType. This is as simple as using the name argument in scale_colour_discrete(). Also change ‘Child’ to ‘Human Child’ using labels=c() - note that for discrete scales, it will expect as many labels as there are values in the c() vector you pass. Also be mindful of the existing order of the values - ggplot will let you label them incorrectly. Try it:
scores<-scores+
scale_colour_discrete(name="Learner Type", labels=c("Baboon","Human Child"))
#below would throw an error because you've only passed one label
#scale_colour_discrete(name="Learner Type", labels=c("Human Child"))
#below would run perfectly, but now the baboon data would be labelled "Human Child" because you've reversed the order
#scale_colour_discrete(name="Learner Type", labels=c("Human Child","Baboon"))
scores
Note that we’ve used a slightly different discription of the scale here, ending in _discrete rather than _continuous - this is because we’ve mapped colour to a discrete variable rather than a continous one. Likewise, if you had a completely categorical x axis (something like e.g., nationality), you’d have to use the function scale_x_categorical() to specify the properties of the scale.
There are a few more tidbits we want to tidy up before we move on. First, let’s look at the plot call as a whole - we can lose sight of this using the scores+ method, and from now on we’ll have to edit the entire chunk of code because we’re changing values in particular existing functions, as opposed to adding entire functions. Normally, you’d do this in a single chunk and run it repeatedly as you make changes. However, we’ll do this sequentially as part of the exercise (though this means some ugly code repetition you wouldn’t normally have in your own pipeline).
scores<-ggplot(data=performance, aes(x=Generation, y=meanScore,colour=LearnerType))+
geom_hline(yintercept=0.65, linetype="dashed")+
geom_line()+
geom_errorbar(aes(ymin=meanScore-se,ymax=meanScore+se))+
annotate("text",label="Chance\nPerformance",x=2,y=0.65)+
scale_y_continuous(limits=c(0.6,1), breaks=c(0.6,0.8,1.0),name="Proportion Successful Trials")+
scale_x_continuous(limits=c(1,10), breaks=c(1,2,3,4,5,6,7,8,9,10), name="Generation (Time)")+
scale_colour_discrete(name="Learner Type", labels=c("Baboon","Human Child"))
scores
First, let’s deal with the errorbars that are getting cut off - change geom_errorbar to geom_pointrange and look at the difference.
scores<-ggplot(data=performance, aes(x=Generation, y=meanScore,colour=LearnerType))+
geom_hline(yintercept=0.65, linetype="dashed")+
geom_line()+
#change geom_errorbar to geom_pointrange
geom_pointrange(aes(ymin=meanScore-se,ymax=meanScore+se))+
annotate("text",label="Chance\nPerformance",x=2,y=0.65)+
scale_y_continuous(limits=c(0.6,1), breaks=c(0.6,0.8,1.0),name="Proportion Successful Trials")+
scale_x_continuous(limits=c(1,10), breaks=c(1,2,3,4,5,6,7,8,9,10), name="Generation (Time)")+
scale_colour_discrete(name="Learner Type", labels=c("Baboon","Human Child"))
scores
That already looks much cleaner, and it’s added points which are useful for seeing where the actual means are more clearly. There’s one final change that we can make to align the aesthetics of the data with the published graphs: they’ve used the shape of the points to contrast between the children and baboons. We can do the same - add a shape aesthetic to geom_pointrange(), and map it to LearnerType.
scores<-ggplot(data=performance, aes(x=Generation, y=meanScore,colour=LearnerType, shape=LearnerType))+
geom_hline(yintercept=0.65, linetype="dashed")+
geom_line()+
#change geom_errorbar to geom_pointrange
geom_pointrange(aes(ymin=meanScore-se,ymax=meanScore+se, shape=LearnerType))+
annotate("text",label="Chance\nPerformance",x=2,y=0.65)+
scale_y_continuous(limits=c(0.6,1), breaks=c(0.6,0.8,1.0),name="Proportion Successful Trials")+
scale_x_continuous(limits=c(1,10), breaks=c(1,2,3,4,5,6,7,8,9,10), name="Generation (Time)")+
scale_colour_discrete(name="Learner Type", labels=c("Baboon","Human Child"))
scores
This has done something slightly undesirable: it’s given us a separate legend for colour and shape - this because towards the end of the code block we’ve redefined the properties of the colour scale. Since these don’t match between shape and colour, ggplot has created two different legends. Make
scale_shape_discrete() identical to scale_colour_discrete(), and it will automatically merge these:
scores<-ggplot(data=performance, aes(x=Generation, y=meanScore,colour=LearnerType, shape=LearnerType))+
geom_hline(yintercept=0.65, linetype="dashed")+
geom_line()+
#change geom_errorbar to geom_pointrange
geom_pointrange(aes(ymin=meanScore-se,ymax=meanScore+se, shape=LearnerType))+
annotate("text",label="Chance\nPerformance",x=2,y=0.65)+
scale_y_continuous(limits=c(0.6,1), breaks=c(0.6,0.8,1.0),name="Proportion Successful Trials")+
scale_x_continuous(limits=c(1,10), breaks=c(1,2,3,4,5,6,7,8,9,10), name="Generation (Time)")+
scale_colour_discrete(name="Learner Type", labels=c("Baboon","Human Child"))+
scale_shape_discrete(name="Learner Type", labels=c("Baboon","Human Child"))
scores
Finally, the legend hanging way out on the right there is making this plot a lot wider than it needs to be, and this is kind of distracting. Let’s move it into the plot itself, on the lower right where there’s not much going on. Google change legend position ggplot to figure out how to do this - this will give you a perview of the theme() function, which we’ll look at shortly. Note that while using annotate() used the x,y values on the plot determined by your data, all positioning within theme() uses x,y coordinates between 0 and 1, where the top left of the plot is 0,0, and the bottom right is 1,1. We’ll be creating a custom theme that you can keep and apply to all your plots.
scores<-ggplot(data=performance, aes(x=Generation, y=meanScore,colour=LearnerType, shape=LearnerType))+
geom_hline(yintercept=0.65, linetype="dashed")+
geom_line()+
#change geom_errorbar to geom_pointrange
geom_pointrange(aes(ymin=meanScore-se,ymax=meanScore+se, shape=LearnerType))+
annotate("text",label="Chance\nPerformance",x=1.5,y=0.65)+
scale_y_continuous(limits=c(0.6,1), breaks=c(0.6,0.8,1.0),name="Proportion Successful Trials")+
scale_x_continuous(limits=c(1,10), breaks=c(1,2,3,4,5,6,7,8,9,10), name="Generation (Time)")+
scale_colour_discrete(name="Learner Type", labels=c("Baboon","Human Child"))+
scale_shape_discrete(name="Learner Type", labels=c("Baboon","Human Child"))+
theme(legend.position=c(0.8,0.3))
scores
Once you’ve had a go, think quickly before moving onto colour and shape about why we wouldn’t want to put legend.position into a custom theme that we’d apply generally to all your plots, even if you want all your legends to be inset within your plots6
Now let’s think about our colours and shapes. The colours and shapes ggplot has used to denote baboons vs children are the defaults. You probably don’t want to use these for a few reasons.
Luckily, all of this can be done by messing with our scales, and adding specific values(). However, because we’re manually over-riding some basic defaults, we now need to change from scale_whatever_discrete() to scale_whatever_manual(). Alter the code below to change the discrete scales to manual scales, and add values.
scores<-ggplot(data=performance, aes(x=Generation, y=meanScore,colour=LearnerType, shape=LearnerType))+
geom_hline(yintercept=0.65, linetype="dashed")+
geom_line()+
#change geom_errorbar to geom_pointrange
geom_pointrange(aes(ymin=meanScore-se,ymax=meanScore+se, shape=LearnerType))+
annotate("text",label="Chance\nPerformance",x=2,y=0.65)+
scale_y_continuous(limits=c(0.6,1), breaks=c(0.6,0.8,1.0),name="Proportion Successful Trials")+
scale_x_continuous(limits=c(1,10), breaks=c(1,2,3,4,5,6,7,8,9,10), name="Generation (Time)")+
scale_colour_manual(name="Learner Type", labels=c("Baboon","Human Child"), values=c("#5B5F97","#A5BE00"))+
scale_shape_manual(name="Learner Type", labels=c("Baboon","Human Child"), values=c(0,2))+
theme(legend.position=c(0.8,0.3))
scores
We’ve done a fair bit to this plot, but it still looks generic, and very unlike the published plots. The remaining aesthetics - font, background colour, etc. - are related to the
theme() rather than the scales. In the next section, we’ll make a custom theme for our plots.
Start out by looking at the default themes available in ggplot2 - you can simply add these to any plot to make fairly immediate changes. My favourite is theme_bw(), illustrated below, but I encourage you to try others.
scores+theme_bw()
Note, however, that this has messed with our legend, because it constitiutes a second call to theme(), overriding the first. We need to set a custom theme, and then the position called within the later theme call will apply in addition this. Below is my preferred theme - notice that that it elaborates upon
theme_minimal(). Also note that theme_set() overrides existing theme pre-sets, and then the existing call to theme() within scores adds the legend in the correct position.
theme_set(theme_minimal()+theme(text = element_text(family = "Times",size=15),plot.title = element_text(hjust = 0.5)))
scores
Fiddle with this to create your own theme, and then look at the
predictability and tetrominoes plots again. Note that you can set your plotting theme at the start of you R Session or on the first code chunk in RMarkdown where you generally load packages - it will apply this theme to all the plots you generate during the session. Now, move on to applying this elsewhere, and debugging one final issue:
predictability and tetrominoes plots in the same way we did the scores plot so it has a matching aesthetic, particularly in terms of the scales.tetrominoes<-ggplot(data=tetrominoProp, aes(x=Generation, y=tetProp,colour=LearnerType, shape=LearnerType))+
geom_hline(yintercept=0.0005, linetype="dashed")+
geom_line()+
geom_pointrange(aes(ymin=tetProp-se,ymax=tetProp+se, shape=LearnerType))+
annotate("text",label="Chance\nProportion",x=2,y=0.07)+
scale_y_continuous(limits=c(0,1), breaks=c(0,0.5,1.0),name="Proportion of Tetrominoes Produced")+
scale_x_continuous(limits=c(1,10), breaks=c(1,2,3,4,5,6,7,8,9,10), name="Generation (Time)")+
scale_colour_manual(name="Learner Type", labels=c("Baboon","Human Child"), values=c("#5B5F97","#A5BE00"))+
scale_shape_manual(name="Learner Type", labels=c("Baboon","Human Child"), values=c(0,2))+
theme(legend.position=c(0.8,0.3))
tetrominoes
predictability<-ggplot(data=meanEnt, aes(x=Generation, y=meanEntropy, colour=LearnerType))+
geom_line()+
geom_pointrange(aes(ymin=meanEntropy-se,ymax=meanEntropy+se,shape=LearnerType))+
ylab("Mean Entropy of Response Set")+
scale_x_continuous(limits=c(1,10), breaks=c(1,2,3,4,5,6,7,8,9,10), name="Generation (Time)")+
scale_colour_manual(name="Learner Type", labels=c("Baboon","Human Child"), values=c("#5B5F97","#A5BE00"))+
scale_shape_manual(name="Learner Type", labels=c("Baboon","Human Child"), values=c(0,2))+
theme(legend.position=c(0.5,0.8))
predictability
Network visualisation is an advanced topic, and one we won’t have much time to dig into. There are three reasons for this: - Assuming at least some attendees are just starting out with ggplot, we probably don’t have time. However, if you’re already fairly advanced with this, you might blow through these exercises and get here on your own. - Network visualisation is advanced. Eyeballing data like the proportions we’ve just looked at, or many other categorical or continuous variables, is inherently useful. This is not necessarily the case for networks: the larger a network is, the more useful it is to analyse, but the less useful it is to visualise. Large networks quickly get visually garbled - you’re better off comparing e.g., continuous attributes of nodes with their degree using a scatterplot. - I haven’t personally used R for network visualisation very much; I’ve used d3js, which is an entirely different animal (it’s based on JavaScript rather than R, and uses JSON style data in lieu of spreadsheet style dataframes. But it’s very useful to learn if you’re keen on visualisation).
Regardless, maybe we can make some headway on network visualization in R together. One of the things I hope you take away from this workshop is that once you know some basics, you can google almost anything - there are tons of online resources for R. For edaxample, this workshop by Katy Ognyanova looks to be a very detailed workshop on network visualisation from an expert. While teaching yourself on the internet might not make for the fastest progress, you will progress, and you’ll learn concepts more deeply applying them to problems that are inherently meaningful to you rather than following arbitrary tutorials.
Below, I’ve started to fiddle with the igraph() package (docs here) and the ggnetwork()package, which allows you to draw networks using ggplot style syntax. I’ve used data from Wild et al., 2019 which deals with the diffusion of sponge foraging in Dolphins (it has some authors that might be familiar to you from elsewhere in the workshop, including Sonja Wild and Will Hoppit). The data is inside the networkData folder - take a closer look at the paper to learn more about it. Note, however, related to my point above about the utility of network visualisation: the paper doesn’t actually include any network graphs - this is probably because with networks this large, they aren’t terribly useful. It’s nonetheless useful data to use to start to learn how to visualise networks in R. However, if you have some of your own network data to play with, this will be even more useful.
Below, I’ve made a crack at looking at relatedness among sponge foragers, which, in and of itself didn’t account for much diffusion of foraging strategies in Wild et al.’s findings. I’ve run up against some walls already, which I’ve noted - mainly,i need to think of ways to make the network smaller before it can be useful to look at. See if you can make progress, or use the other data files (particularly social vertical and horizontal relatedness) to check out other ways of looking at the network.
library(igraph)
##
## Attaching package: 'igraph'
## The following object is masked from 'package:DescTools':
##
## %c%
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
## The following objects are masked from 'package:purrr':
##
## compose, simplify
## The following object is masked from 'package:tidyr':
##
## crossing
## The following object is masked from 'package:tibble':
##
## as_data_frame
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
library(ggnetwork)
#load relatedness matrix
relatedness<-read_csv("networkData/relatedness.csv")
## Warning: Missing column names filled in: 'X1' [1]
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## .default = col_double(),
## X1 = col_character()
## )
## ℹ Use `spec()` for the full column specifications.
#load individual attributes
indVars<-read_csv("networkData/ILVs.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## id_individual = col_character(),
## Sex_1_0 = col_double(),
## Haplotype = col_character(),
## Number_sightings = col_double(),
## Av_water_depth = col_double(),
## Av_group_size = col_double(),
## Sponger = col_character(),
## Demons_sponging_forage = col_character(),
## Sp_Order_acquisition = col_double(),
## Mum_known = col_character(),
## Not_weaned = col_double()
## )
#filter out dolphins who had less than 7 sightings, these were excluded from analysis
indVars<-indVars %>%
filter(Number_sightings>7, Sponger=="yes")
#use a copy of the IDs as a vector so we can apply this to the matrix
validDolphins<-as.vector(indVars$id_individual)
rel<-relatedness %>%
#filter out rows taht have dolphins with few sightings
filter(X1 %in% validDolphins) %>%
#select columns that are valid dolphins
select(one_of(validDolphins))
g<-graph_from_adjacency_matrix(as.matrix(rel),weighted="relcoef") %>%
set_vertex_attr("AvgGroupSize",value=indVars$Av_group_size)
#delete edges with very low relatedness?
#g<-delete_edges(g, which(E(g)$relcoef<0.1))
#above works, but not so well without also deleting notes that have no edges after this point
# basic format for visualising the network
nettest<-ggplot(g, aes(x=x,y=y,xend=xend,yend=yend))+
geom_edges(aes(alpha=relcoef),curvature=0.1)+
geom_nodes(aes(size=AvgGroupSize), colour="blue")+
theme_blank()
nettest
read_csv spits out some informaiton about Column Specifications. Can you tell what this is all about? Why would you want this?↩︎
group_by() is an incredibly useful function, especially with summaries, but there is a conflict between the tidyverse package underlying this, dplyr, and another common package used for data manipulation plyr. If you’ve loaded plyr after dplyr, group_by() (and potentially some other dplyr functions) won’t work properly. Even if you don’t think you’re using plyr, it’s a dependency of many other R packages and might be loaded without you really knowing this is happening. The tidyverse is set to throw a warning about this when plyer is loaded, but you might miss it (and then be mystified as to why group_by is failing to work), so keep it in mind. You may need to fiddle with the order in which you load packages or detach packages before trying to use group_by()↩︎
In fact, R won’t allow this. If your variable names have spaces in them in e.g., an input .csv or Excel file, R will replace the spaces with a dot, such that e.g., “Learner Type” would become “Learner.Type”.↩︎
The reason is because otherwise it’s just drawing a dashed line over the solid line I put in the original plot, so we can’t see it. We need to remove that original line. To see this, first try doing scores+geom_hline(yintercept=0.65, linetype="dashed",colour="red") - you’ll be able to see it draw a dashed red line over the solid black one.↩︎
I think truncating the y-axis in the tetromino case is actually a bit suspect, and misrepresents the results a bit. The chance of randomly producing at tetromino is much lower, something like 0.002, so truncating the y axis around 0.6 is a bit odd. What this does is make the variation in tetromino proportions look fairly intense, and quite disparate between children and babboons. However, when you look at our graph where we’ve added the baseline (and thus, extended the y-axis range automatically, by asking ggplot to add something at a y-intercept of 0.002), there’s less of a distance between baboons and children, and it’s clearer they’re both performing way above baseline consistently, and in fact, close to ceiling.↩︎
The reason not to put this in a general theme is because a) it won’t always be possible to put the legend within the plot; sometimes the plot is just full of data! and b) Even if it is possible to put the legend in the plot, where it will go will change from plot to plot, depending largely on where there might be a little space. Therefore, we don’t really want this to be part of a theme we apply to all our plots (like font, font size, axis label tilting, etc) because there is no one-position-fits-all solution here.↩︎